library(CelltypeR)
Error in library(CelltypeR) : there is no package called ‘CelltypeR’
Read in the flow data This data should be the gated live cells.
All samples need to be in one folder.
input_path <- "/Users/rhalenathomas/Documents/Data/FlowCytometry/PhenoID/FlowDataFiles/9MBO"
output_path <- "/Users/rhalenathomas/Documents/Data/FlowCytometry/PhenoID/Analysis/testingLibrary/"
# 1.a Read in FlowJo Files
# In Figure 3 downsample to 9000 where possible
# Here downsample to 1000 to speed up computation
flowset <- fsc_to_fs(input_path, downsample = 1000)
# down sample can be a number, 'none' or 'min'
# look at file names and rename with shorter sample names
sampleNames(flowset)
[1] "2020-03-06- export_bioinfo_3450c_live cells.fcs"
[2] "2020-03-06- export_bioinfo_AIW_live cells.fcs"
[3] "2020-03-06- export_bioinfo_AJG_live cells.fcs"
[4] "2020-03-17- export_bioinfo_old 3450c_live cells.fcs"
[5] "2020-03-17- export_bioinfo_old AIW_live cells.fcs"
[6] "2020-03-17- export_bioinfo_old AJG_live cells.fcs"
[7] "2020-03-17- export_bioinfo_young 3450c_live cells.fcs"
[8] "2020-03-17- export_bioinfo_young AIW_live cells.fcs"
[9] "2020-03-17- export_bioinfo_young AJG_live cells.fcs"
sampleNames(flowset) <- sampleNames(flowset) <- c("3450_0306","AIW002_0306","AJG001C_0306","3450_0317A","AIW002_0317A","AJG001C_0317A","3450_0317B","AIW002_0317B","AJG001C_0317B")
sampleNames(flowset)
[1] "3450_0306" "AIW002_0306" "AJG001C_0306"
[4] "3450_0317A" "AIW002_0317A" "AJG001C_0317A"
[7] "3450_0317B" "AIW002_0317B" "AJG001C_0317B"
# if we want to save the flowset object now we can
# this would be read back in with flowset
#
Harmonize data to remove batch or technical variation
This requires us to look and see where there are two peaks to align. We need to visualize the peaks of the transformed data before aligning.
# we can decided what level of processing to choose with the argument 'processing'
# biexp only applies a biexponential transformation
# align applies biexp transform and then aligns peaks
# retro (default), transforms, aligns and then reverse transforms
flowset_biexp <- harmonize(flowset, processing = 'biexp')
# we can visualize the peaks to see where there are two peaks for alignment
# we need to enter the column index for which peaks to align, the alignment for one or two peaks is not the same.
#plotdensity_flowset(flowset)
#plotdensity_flowset(flowset_biexp) # to see the peaks
flowset_align <- harmonize(flowset, processing = 'align',
two_peaks = c(7:20),
one_peak = c(1:6,21), threshold = 0.01)
Adjusting the distance between landmarks
.........
Adjusting the distance between landmarks
.........
flowset_retro <- harmonize(flowset, processing = 'retro',
two_peaks = c(7:20),
one_peak = c(1:6,21), threshold = 0.01)
Adjusting the distance between landmarks
.........
Adjusting the distance between landmarks
.........
df <- flowset_to_csv(flowset_retro)
Now we have made all the different processing of the fsc files. We can visualize the intensity in cell density plots to see the alignment
#plotdensity_flowset(flowset)
plotdensity_flowset(flowset_biexp)
Warning in melt(lapply(as.list(flowset@frames), function(x) { :
The melt generic in data.table has been passed a list and will attempt to redirect to the relevant reshape2 method; please note that reshape2 is deprecated, and this redirection is now deprecated as well. To continue using melt methods from reshape2 while both libraries are attached, e.g. melt.list, you can prepend the namespace like reshape2::melt(lapply(as.list(flowset@frames), function(x) { x = as.data.frame(x@exprs)})). In the next version, this warning will become an error.
No id variables; using all as measure variables
No id variables; using all as measure variables
No id variables; using all as measure variables
No id variables; using all as measure variables
No id variables; using all as measure variables
No id variables; using all as measure variables
No id variables; using all as measure variables
No id variables; using all as measure variables
No id variables; using all as measure variables
Warning in geom_density_ridges(alpha = 0.4, verbose = FALSE) :
Ignoring unknown parameters: `verbose`
Picking joint bandwidth of 0.0779
Picking joint bandwidth of 0.0498
Picking joint bandwidth of 0.0278
Picking joint bandwidth of 0.136
Picking joint bandwidth of 0.11
Picking joint bandwidth of 0.0292
Picking joint bandwidth of 1.27
Picking joint bandwidth of 0.253
Picking joint bandwidth of 1.21
Picking joint bandwidth of 1.46
Picking joint bandwidth of 0.405
Picking joint bandwidth of 0.398
Picking joint bandwidth of 0.267
Picking joint bandwidth of 0.638
Picking joint bandwidth of 1.01
Picking joint bandwidth of 1.25
Picking joint bandwidth of 0.97
Picking joint bandwidth of 0.42
Picking joint bandwidth of 1.22
Picking joint bandwidth of 0.973
Picking joint bandwidth of 0.175
plotdensity_flowset(flowset_align)
Warning in melt(lapply(as.list(flowset@frames), function(x) { :
The melt generic in data.table has been passed a list and will attempt to redirect to the relevant reshape2 method; please note that reshape2 is deprecated, and this redirection is now deprecated as well. To continue using melt methods from reshape2 while both libraries are attached, e.g. melt.list, you can prepend the namespace like reshape2::melt(lapply(as.list(flowset@frames), function(x) { x = as.data.frame(x@exprs)})). In the next version, this warning will become an error.
No id variables; using all as measure variables
No id variables; using all as measure variables
No id variables; using all as measure variables
No id variables; using all as measure variables
No id variables; using all as measure variables
No id variables; using all as measure variables
No id variables; using all as measure variables
No id variables; using all as measure variables
No id variables; using all as measure variables
Warning in geom_density_ridges(alpha = 0.4, verbose = FALSE) :
Ignoring unknown parameters: `verbose`
Picking joint bandwidth of 0.0779
Picking joint bandwidth of 0.0498
Picking joint bandwidth of 0.0278
Picking joint bandwidth of 0.136
Picking joint bandwidth of 0.11
Picking joint bandwidth of 0.0292
Picking joint bandwidth of 1.27
Picking joint bandwidth of 0.357
Picking joint bandwidth of 1.22
Picking joint bandwidth of 1.47
Picking joint bandwidth of 0.485
Picking joint bandwidth of 0.391
Picking joint bandwidth of 0.265
Picking joint bandwidth of 0.659
Picking joint bandwidth of 1.01
Picking joint bandwidth of 1.29
Picking joint bandwidth of 0.967
Picking joint bandwidth of 0.424
Picking joint bandwidth of 1.24
Picking joint bandwidth of 0.97
Picking joint bandwidth of 0.175
#plotdensity_flowset(flowset_retro)
# the function make_seu will take in the df of expression and Antibody/Marker list as a vector and create a seurat object. Values are scaled. Marker expression will be in the "RNA" slot. PCA is calculated using AB vector as the features
# make sure to always keep the same antibody order or your labels will not be correct
# antibody features in order to appear on the plots
AB <- c("CD24","CD56","CD29","CD15","CD184","CD133","CD71","CD44","GLAST","AQP4","HepaCAM", "CD140a","O4")
seu <- make_seu(df, AB_vector = AB)
Centering and scaling data matrix
|
| | 0%
|
|========================================================| 100%
Warning in irlba(A = t(x = object), nv = npcs, ...) :
You're computing too large a percentage of total singular values, use a standard svd instead.
Warning: Requested number is larger than the number of available items (13). Setting to 13.
Warning: Requested number is larger than the number of available items (13). Setting to 13.
Warning: Requested number is larger than the number of available items (13). Setting to 13.
Warning: Requested number is larger than the number of available items (13). Setting to 13.
Warning: Requested number is larger than the number of available items (13). Setting to 13.
PC_ 1
Positive: CD44, CD184, CD15, CD24, CD56, CD133
Negative: CD29, CD140a, O4, CD71, AQP4, HepaCAM
PC_ 2
Positive: CD15, CD56, GLAST, CD140a, AQP4, O4
Negative: CD29, CD44, CD184, CD133, CD24, HepaCAM
PC_ 3
Positive: CD56, CD29, CD133, GLAST, AQP4, CD71
Negative: CD44, CD184, CD140a, O4, HepaCAM, CD15
PC_ 4
Positive: CD56, CD24, CD29, CD15, CD140a, CD184
Negative: CD133, AQP4, HepaCAM, CD71, O4, GLAST
PC_ 5
Positive: CD184, CD24, CD133, O4, CD140a, HepaCAM
Negative: CD44, CD71, CD56, CD15, AQP4, GLAST
You can save the data frame and seurat object for later
# set a pathway to where you want to save the data
output_path <- "/User/output/flow/"
# to save the df for later
write.csv(df, paste(output_path,"df1000.csv", sep = ""))
# save the seurat object
saveRDS(seu, paste(output_path, "seuratObject1000.csv", sep = ""))
Read in the csv of the flow files processed and the seurat object
To run intrinsic statistics we need the data frame for all clustering methods. 1. Silhouette score: -1 to 1.A value near -1 indicates a poor quality of the clusters a value near 1 indicates a good quality of the clusters. 2. Calinski-Harabasz index: higher values indicate better quality clusters. 3. Davies-Bouldin index: lower values indicate better clusters. The min value is 0.
Note that even with only 1000 cells these processes are slow and if you want to test larger data sets use base R and HPC.
Test Flowsom
# you need to load clustree for this to work right now
library(clustree)
flow.test <- explore_param(input = seu,
cluster_method = 'flowsom',
df_input= df.input,
flow_k = c(5,10,15,20),
pheno_lou_kn = NULL,
lou_resolution = NULL,
run.plot = TRUE,
run.stats = TRUE,
save_to = NULL)
Building SOM
Mapping data to SOM
Building MST
20:02:21 UMAP embedding parameters a = 0.9922 b = 1.112
20:02:21 Read 9000 rows and found 12 numeric columns
20:02:21 Using Annoy for neighbor search, n_neighbors = 30
20:02:21 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:02:21 Writing NN index file to temp file /var/folders/k4/khtkczkd5tn732ftjpwgtr240000gn/T//RtmpfahyaS/file1653d7eaf700b
20:02:21 Searching Annoy index using 1 thread, search_k = 3000
20:02:23 Annoy recall = 100%
20:02:24 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
20:02:24 Initializing from normalized Laplacian + noise (using irlba)
20:02:25 Commencing optimization for 500 epochs, with 383272 positive edges
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:02:39 Optimization finished
[1] "Clustree"
Notice in the UMAPs and heatmaps a large amount of cells are all in one or two clusters and small amounts of cells going into small clusters. This isn’t good for us to label cell types.
# Look at the intrinsic statistics
flow.test[[1]]
NA
The statistics indicate that the fewer the lowest K value has the best quality clusters.
Test Phenograph clustering
pheno.test <- explore_param(seu, #for phenograph and louvain only
cluster_method = 'phenograph', #take 1 cluster method
df_input= df.input, #needed if input "flowsom"
flow_k = NULL, #k for flowsom
pheno_lou_kn = c(150,300,500), #kn for phenograph or louvain
lou_resolution = NULL, #resolution for louvain
run.plot = TRUE, #print if run
run.stats = TRUE, #print and return if run
save_to = NULL)
20:25:24 UMAP embedding parameters a = 0.9922 b = 1.112
20:25:24 Read 9000 rows and found 12 numeric columns
20:25:24 Using Annoy for neighbor search, n_neighbors = 150
20:25:24 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:25:25 Writing NN index file to temp file /var/folders/k4/khtkczkd5tn732ftjpwgtr240000gn/T//RtmpfahyaS/file1653d20bc16b5
20:25:25 Searching Annoy index using 1 thread, search_k = 15000
20:25:34 Annoy recall = 100%
20:25:35 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 150
20:25:37 Initializing from normalized Laplacian + noise (using irlba)
20:25:37 Commencing optimization for 500 epochs, with 1616942 positive edges
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:26:01 Optimization finished
Run Rphenograph starts:
-Input data of 9000 rows and 13 columns
-k is set to 150
Finding nearest neighbors...DONE ~ 0.82 s
Compute jaccard coefficient between nearest-neighbor sets...DONE ~ 38.237 s
Build undirected graph from the weighted links...DONE ~ 6.445 s
Run louvain clustering on the graph ...DONE ~ 2.332 s
Run Rphenograph DONE, totally takes 47.8340000000003s.
Return a community class
-Modularity value: 0.7695235
-Number of clusters: 11
20:26:55 UMAP embedding parameters a = 0.9922 b = 1.112
20:26:55 Read 9000 rows and found 12 numeric columns
20:26:55 Using Annoy for neighbor search, n_neighbors = 300
20:26:55 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:26:56 Writing NN index file to temp file /var/folders/k4/khtkczkd5tn732ftjpwgtr240000gn/T//RtmpfahyaS/file1653d1c76d5dd
20:26:56 Searching Annoy index using 1 thread, search_k = 30000
20:27:13 Annoy recall = 100%
20:27:13 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 300
20:27:16 Initializing from normalized Laplacian + noise (using irlba)
20:27:17 Commencing optimization for 500 epochs, with 2272820 positive edges
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:27:44 Optimization finished
Run Rphenograph starts:
-Input data of 9000 rows and 13 columns
-k is set to 300
Finding nearest neighbors...DONE ~ 1.344 s
Compute jaccard coefficient between nearest-neighbor sets...DONE ~ 155.13 s
Build undirected graph from the weighted links...DONE ~ 13.209 s
Run louvain clustering on the graph ...DONE ~ 5.086 s
Run Rphenograph DONE, totally takes 174.769s.
Return a community class
-Modularity value: 0.7260979
-Number of clusters: 9
20:30:46 UMAP embedding parameters a = 0.9922 b = 1.112
20:30:46 Read 9000 rows and found 12 numeric columns
20:30:46 Using Annoy for neighbor search, n_neighbors = 500
20:30:46 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:30:47 Writing NN index file to temp file /var/folders/k4/khtkczkd5tn732ftjpwgtr240000gn/T//RtmpfahyaS/file1653d3dd2aec4
20:30:47 Searching Annoy index using 1 thread, search_k = 50000
20:31:15 Annoy recall = 100%
20:31:15 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 500
20:31:21 Initializing from normalized Laplacian + noise (using irlba)
20:31:22 Commencing optimization for 500 epochs, with 2593020 positive edges
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:31:52 Optimization finished
Run Rphenograph starts:
-Input data of 9000 rows and 13 columns
-k is set to 500
Finding nearest neighbors...DONE ~ 2.199 s
Compute jaccard coefficient between nearest-neighbor sets...DONE ~ 432.167 s
Build undirected graph from the weighted links...DONE ~ 22.728 s
Run louvain clustering on the graph ...DONE ~ 7.992 s
Run Rphenograph DONE, totally takes 465.086s.
Return a community class
-Modularity value: 0.6808627
-Number of clusters: 9
## clustree in phenograph doesn't make sense but I've included it to see it anyway
# look at the intrinsic stats table
pheno.test[[1]]
NA
NA
The silhouette score and DBI are best for fewer kn = 500 but the CHI is best for kn = 300. The stats on contradicting.
Test Louvain clustering
# for best clustree visualization include resolution 0
lou.test <- explore_param(input = seu, #for phenograph and louvain only
cluster_method = "louvain",
df_input = df.input,
flow_k = NULL,
pheno_lou_kn = c(60, 200),
lou_resolution = c(0,0.25,0.5,0.8),
run.plot = TRUE,
run.stats = TRUE,
save_to = NULL)
[1] "finding neighbours for kn 60"
Computing nearest neighbor graph
Computing SNN
20:44:11 UMAP embedding parameters a = 0.9922 b = 1.112
20:44:11 Read 9000 rows and found 12 numeric columns
20:44:11 Using Annoy for neighbor search, n_neighbors = 60
20:44:11 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:44:12 Writing NN index file to temp file /var/folders/k4/khtkczkd5tn732ftjpwgtr240000gn/T//RtmpfahyaS/file1653d4ee8bd10
20:44:12 Searching Annoy index using 1 thread, search_k = 6000
20:44:16 Annoy recall = 100%
20:44:16 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 60
20:44:17 Initializing from normalized Laplacian + noise (using irlba)
20:44:18 Commencing optimization for 500 epochs, with 756780 positive edges
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:44:35 Optimization finished
Warning: The following arguments are not used: reduction
Warning: The following arguments are not used: reduction
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 1.0000
Number of communities: 1
Elapsed time: 2 seconds
[1] "completed louvain kn 60 resolution 0"
[1] 9000
Warning: The following arguments are not used: reduction
Warning: The following arguments are not used: reduction
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8663
Number of communities: 7
Elapsed time: 2 seconds
[1] "completed louvain kn 60 resolution 0.25"
[1] 9000
Warning: The following arguments are not used: reduction
Warning: The following arguments are not used: reduction
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8202
Number of communities: 10
Elapsed time: 2 seconds
[1] "completed louvain kn 60 resolution 0.5"
[1] 9000
Warning: The following arguments are not used: reduction
Warning: The following arguments are not used: reduction
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7847
Number of communities: 12
Elapsed time: 2 seconds
[1] "completed louvain kn 60 resolution 0.8"
[1] 9000
[1] "finding neighbours for kn 200"
Computing nearest neighbor graph
Computing SNN
20:45:15 UMAP embedding parameters a = 0.9922 b = 1.112
20:45:15 Read 9000 rows and found 12 numeric columns
20:45:15 Using Annoy for neighbor search, n_neighbors = 200
20:45:15 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:45:16 Writing NN index file to temp file /var/folders/k4/khtkczkd5tn732ftjpwgtr240000gn/T//RtmpfahyaS/file1653d11d48de9
20:45:16 Searching Annoy index using 1 thread, search_k = 20000
20:45:27 Annoy recall = 100%
20:45:28 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 200
20:45:30 Initializing from normalized Laplacian + noise (using irlba)
20:45:31 Commencing optimization for 500 epochs, with 1915956 positive edges
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:45:56 Optimization finished
Warning: The following arguments are not used: reduction
Warning: The following arguments are not used: reduction
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 3086396
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 1.0000
Number of communities: 1
Elapsed time: 6 seconds
[1] "completed louvain kn 200 resolution 0"
[1] 9000
Warning: The following arguments are not used: reduction
Warning: The following arguments are not used: reduction
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 3086396
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8139
Number of communities: 5
Elapsed time: 11 seconds
[1] "completed louvain kn 200 resolution 0.25"
[1] 9000
Warning: The following arguments are not used: reduction
Warning: The following arguments are not used: reduction
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 3086396
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7421
Number of communities: 7
Elapsed time: 6 seconds
[1] "completed louvain kn 200 resolution 0.5"
[1] 9000
Warning: The following arguments are not used: reduction
Warning: The following arguments are not used: reduction
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 3086396
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.6851
Number of communities: 9
Elapsed time: 7 seconds
[1] "completed louvain kn 200 resolution 0.8"
[1] 9000
lou.test[[1]]
NA
The statistic differ for best clustering. Overall the intrinsic statistics are not helpful for this data for choosing cluster conditions. We decided to use the heatmaps and UMAPS for clusters that look best for annotation. We want clusters that are distinct from eachother on the UMAP and have clear markers on the heatmap. We then used the RAND Index to determine the final cluster conditions.
Check cluster stability - best resolution/cluster number/k value by calculating RAND Index and STD of cluster number
RI <- clust_stability(input = seu,
resolutions = c(0.1,0.25,0.5,0.8,1),
kn = 60,
n = 5, #number of iterations
save_to = NULL)
Computing nearest neighbor graph
Computing SNN
21:16:52 UMAP embedding parameters a = 0.9922 b = 1.112
21:16:52 Read 9000 rows and found 12 numeric columns
21:16:52 Using Annoy for neighbor search, n_neighbors = 60
21:16:52 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
21:16:52 Writing NN index file to temp file /var/folders/k4/khtkczkd5tn732ftjpwgtr240000gn/T//RtmpfahyaS/file1653d452a41d1
21:16:52 Searching Annoy index using 1 thread, search_k = 6000
21:16:56 Annoy recall = 100%
21:16:57 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 60
21:16:58 Initializing from normalized Laplacian + noise (using irlba)
21:16:58 Commencing optimization for 500 epochs, with 756780 positive edges
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
21:17:16 Optimization finished
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9133
Number of communities: 5
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9134
Number of communities: 5
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9133
Number of communities: 5
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9133
Number of communities: 5
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9133
Number of communities: 5
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8663
Number of communities: 7
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8663
Number of communities: 7
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8663
Number of communities: 7
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8663
Number of communities: 7
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8651
Number of communities: 7
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8197
Number of communities: 11
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8209
Number of communities: 10
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8202
Number of communities: 10
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8200
Number of communities: 10
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8208
Number of communities: 11
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7841
Number of communities: 12
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7848
Number of communities: 12
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7848
Number of communities: 12
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7847
Number of communities: 12
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7848
Number of communities: 12
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7641
Number of communities: 12
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7641
Number of communities: 12
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7642
Number of communities: 12
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7642
Number of communities: 12
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7641
Number of communities: 12
Elapsed time: 3 seconds
# it is recommended to run n= 100, however I ran only 5 here because it is long to calculate
#Look at the RI results
plot_randindex(
rdf = RI$list,
cp = c("orange", "violet"),
view = c(0, 1) #zoom in x axis, this does not changes scales, just the viewable sections
)
# The RAND index indicates if cells are put into the same cluster each time the clustering is repeated - 1 mean they are exactly the same. We ran 10 times so there are 9 RI for each resolution. There are 10 counts of the number of clusters (one for each iteration). Cluster number too low will not account for the true differences in groups, cluster numbers too high are difficult to annotate.
After checking parameters run cluster function to make add the clustering information into the
seu <- get_clusters(seu, method = "louvain",
df_input = NULL, #needed if input "flowsom"
k = 60, #k for flowsom or kn for Phenograph and Seurat Louvain
resolution = 1,
plots = TRUE,
save_plots = FALSE)
Computing nearest neighbor graph
Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 9000
Number of edges: 862796
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7641
Number of communities: 12
Elapsed time: 2 seconds
[1] "method is Louvain"
21:22:33 UMAP embedding parameters a = 0.4502 b = 1.076
21:22:33 Read 9000 rows and found 12 numeric columns
21:22:33 Using Annoy for neighbor search, n_neighbors = 60
21:22:33 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
21:22:34 Writing NN index file to temp file /var/folders/k4/khtkczkd5tn732ftjpwgtr240000gn/T//RtmpfahyaS/file1653d594723d0
21:22:34 Searching Annoy index using 1 thread, search_k = 6000
21:22:38 Annoy recall = 100%
21:22:38 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 60
21:22:40 Initializing from normalized Laplacian + noise (using irlba)
21:22:40 Commencing optimization for 500 epochs, with 756780 positive edges
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
21:22:58 Optimization finished
Visualize expression on UMAP and with heat maps
AB <- c("CD24","CD56","CD29","CD15","CD184","CD133","CD71","CD44","GLAST","AQP4","HepaCAM", "CD140a","O4")
# this will let us see one at at time
for (i in AB) {
print(FeaturePlot(seu, features = i, min.cutoff = 'q1', max.cutoff = 'q97', label = TRUE))
}
NA
NA
Some more visualization of expression values
# summary heat map
# use function plotmean
length(unique(seu$RNA_snn_res.1))
[1] 12
# 12
# if we want to plot by cluster we need a vector of from 0 to the n-1 clusters
cluster.num <- c(0:11)
plotmean(plot_type = 'heatmap',seu = seu, group = 'RNA_snn_res.1', markers = AB,
var_names = cluster.num, slot = 'scale.data', xlab = "Cluster",
ylab = "Antibody")
NA
NA
Predict cell annotations with CAM (Corralations assignment method)
reference_path <- "/Users/rhalenathomas/GITHUB/CelltypeR/Data/ReferenceMatrix9celltypesOrdered.csv"
reference_data <- read.csv(reference_path)
cor <- find_correlation(df.input,
reference_data,
min_corr = 0.45,
min_diff = 0.05)
Error in topn(corr_ls, 2L) : could not find function "topn"
Visualize the CAM results
plot_corr(cor)
Apply correlation predictions to clusters and output a vector for annotation functions
# add the correlation predictions to the meta data
seu <- AddMetaData(object=seu, metadata=cor$cell.label, col.name = 'cor.labels')
# see the labels added
unique(seu$cor.labels)
seu.cluster = seu$RNA_snn_res.0.8
seu.labels = seu$cor.labels
# plot the cluster predictions
plot_lab_clust(seu, seu$RNA_snn_res.0.8, seu$cor.labels)
Run get annotations function to return a vector of annotation in the order of the clusters.
cor.ann <- get_annotation(seu, seu.cluster = seu$RNA_snn_res.0.8,
seu.label = seu$cor.labels, top_n = 3,
filter_out = c("Unknown","unknown","Mixed"),
Label = "CAM")
cor.ann
dim(cor.ann)
cor.ann <- cor.ann[1:22,]
dim(cor.ann)
unique(cor.ann$CAM)
length(cor.ann$CAM)
Use a trained Random Forest model to predict cell types. Training of the Random Forest model with an annotated data set is below.
# you must have a saved trained model from a data object annotated from the same markers
rf <- readRDS("/Users/rhalenathomas/Documents/Data/FlowCytometry/PhenoID/Analysis/PaperFigures/RFM_trained.11072022.Rds")
rfm.pred <- RFM_predict(seu, rf)
head(rfm.pred)
# add the predictions into the seurat object
seu <- AddMetaData(object=seu, metadata=rfm.pred$`predict(rf, df)`, col.name = 'rfm.labels')
# check that the data is added
table(seu$rfm.labels)
Get the annotation by cluster for the RFM
rfm.ann <- get_annotation(seu, seu$RNA_snn_res.0.8,seu$rfm.labels,
top_n = 3, filter_out = c("unknown","Unknown","Mixed","Mix"), Label = "RFM")
rfm.ann
#rfm.ann <- get_annotation(seu, seu$RNA_snn_res.0.8,seu$rfm.labels,
# top_n = 3, filter_out = FALSE, Label = "RFM")
rfm.ann
dim(rfm.ann)
Plot RFM predictions
plot_lab_clust(seu, seu.cluster = seu$RNA_snn_res.0.8, seu.labels = seu$rfm.labels, filter_out = c("unknown","Unknown","Mixed"))
Predicting cell types with Seurat label transfer using anchors
# takes in a seurat object with the labels added
# makes a dataframe with the count of predicted labels for each cluster
# input seurat object with the predicted labels in the meta data
# input the clusters meta data slot to be labels
# input the meta data slot with the labels (correlation, random forest, seurat predicted)
#need reference data object with labels
seu.r<- readRDS("/Users/rhalenathomas/Documents/Data/FlowCytometry/PhenoID/Analysis/PaperFigures/Seu9000annot.08072021.RDS")
# the output is a seurat object with the predicted annotations
seu <- seurat_predict(seu, seu.r, ref_id = 'subgroups', down.sample = 500, markers = AB)
# plot the seurat anchor predictions
# get the annotation table for the seurat anchor predictions
plot_lab_clust(seu, seu$RNA_snn_res.0.8, seu$seu.pred)
# to not filter anything use c()
seu.ann <- get_annotation(seu, seu$RNA_snn_res.0.8,seu$seu.pred,
top_n = 3, filter_out = c(), Label = "Seurat")
seu.ann
Get a consensus of cluster annotations, Add the annotations to the seurat object
# manual annotations
# quick cheat just using the seurat labels to get a sting of annotations
temp <- paste("'",as.character(seu.ann$Seurat), "'", collapse = ", ", sep = "" )
my_ann <- c('Unknown', 'Radial Glia 1', 'Astrocytes 1', 'Mixed', 'Neurons 1', 'Neurons 2', 'Epithelial', 'Astrocytes mature', 'NPC', 'Radial Glia 2', 'Radial Glia 3', 'Endothelial', 'Epithelial', 'Neurons 3', 'Neurons 1', 'Astrocytes 2', 'Neurons 1', 'Unknown', 'Oligodendrocytes', 'Stem-like 1', 'Neural stem', 'Epithelial')
cluster.n <- c(0:21)
man.ann <- data.frame(cluster.n, my_ann)
colnames(man.ann) <- c("Cluster","manual")
# needs to be factors
man.ann <- data.frame(lapply(man.ann, factor))
dim(man.ann)
dim(cor.ann)
dim(rfm.ann)
dim(seu.ann)
# the dataframes must be the same length
# the annotation function takes in a list of the annotation dataframes
ann.list <- list(man.ann,df.cor.ann,rfm.ann,seu.ann)
# annotate the seurat object
seu <- cluster_annotate(seu, ann.list,
annotation_name ="CellType",
to_label = "RNA_snn_res.0.8")
DimPlot(seu, group.by = "CellType")
Just use the annotate functions
seu <- annotate(seu, annotations = man.ann$manual, to_label = "RNA_snn_res.0.8", annotation_name = "MyCellType")
DimPlot(seu, group.by = "MyCellType")
#save with the annotations
saveRDS(seu,"/Users/rhalenathomas/Documents/Data/FlowCytometry/PhenoID/Analysis/testingLibrary/seu9000Feb24.RDS")
Compare groups
seu <- readRDS("/Users/rhalenathomas/Documents/Data/FlowCytometry/PhenoID/Analysis/testingLibrary/seu9000Feb24.RDS")
Add the variables into the seurat object
Genotype <- c("3450","3450","3450","AIW002","AIW002","AIW002","AJG001C","AJG001C","AJG001C")
ExDate <- c("0306","0317","0317","0306","0317","0317","0306","0317","0317")
Batch <- c("B","B","A","B","B","A","B","B","A")
Age <- c("273","284","263","273","284","263","273","284","263")
# Genotype
Idents(seu) <- "Sample"
cluster.ids <- Genotype
# vector with the new names - you need this vector from me
names(cluster.ids) <- levels(seu) # get the levels
seu <- RenameIdents(seu, cluster.ids) # rename
seu$Genotype <- Idents(seu) # add a new dataslot
# Experiment date
Idents(seu) <- "Sample"
cluster.ids <- ExDate
# vector with the new names - you need this vector from me
names(cluster.ids) <- levels(seu) # get the levels
seu <- RenameIdents(seu, cluster.ids) # rename
seu$ExDate <- Idents(seu) # add a new dataslot
# Batch
Idents(seu) <- "Sample"
cluster.ids <- Batch
# vector with the new names - you need this vector from me
names(cluster.ids) <- levels(seu) # get the levels
seu <- RenameIdents(seu, cluster.ids) # rename
seu$Batch <- Idents(seu) # add a new dataslot
# days in final differentiation media
Idents(seu) <- "Sample"
cluster.ids <- Age
# vector with the new names - you need this vector from me
names(cluster.ids) <- levels(seu) # get the levels
seu <- RenameIdents(seu, cluster.ids) # rename
seu$DaysinCulture <- Idents(seu) # add a new dataslot
Plots some variables
# one plot
proportionplots(seu.q,seu.var = seu$Genotype, seu.lable = seu$CellType, groups = "Genotype")
# add colours
colours <- c("chocolate1","chocolate3","orange",
"lightsalmon", "pink","lightpink3",
"steelblue3","deepskyblue",
"plum3","purple",
"seagreen3","tomato4","burlywood3","grey90","brown",
"royalblue3", "tan4","yellowgreen")
proportionplots(seu.q,seu.var = seu$Genotype, seu.lable = seu$CellType, groups = "Genotype", my_colours = colours)
var.list <- list(seu$DaysinCulture,seu$Batch,seu$ExDate,seu$Genotype)
varnames <- c("Days in Culture", "Batch", "Experiment Date", "Genotype")
# plot all the variables of interest at once
plotproportions(seu, var.list = var.list, xgroup = seu$CellType, varnames = varnames, my_colours = c("blue","red"))
Heatmaps
# make sure the order is correct
celltypes <- c("unknown","radial glia 1", "astrocytes 1", "mixed","neurons 1",
"neurons 2", "epithelial", "astrocytes mature", "npc", "radial glia 2",
"radial glia 3", "endothelial","neurons 3", "astrocytes 2",
"oligodendrocytes", "stem-like 1","neural stem")
plotmean(plot_type = 'heatmap',seu = seu, group = 'CellType', markers = AB,
var_names = celltypes, slot = 'scale.data', xlab = "Cell Type",
ylab = "Antibody")
# dot plot
og.order <- c("unknown","radial glia 1", "astrocytes 1", "mixed","neurons 1",
"neurons 2", "epithelial", "astrocytes mature", "npc", "radial glia 2",
"radial glia 3", "endothelial","neurons 3", "astrocytes 2",
"oligodendrocytes", "stem-like 1","neural stem")
# make sure the terms are exactly the same and you don't miss any
new.order <- c("astrocytes 1","astrocytes 2","astrocytes mature",
"endothelial","epithelial", "mixed","neurons 1",
"neurons 2","neurons 3","neural stem","npc",
"oligodendrocytes",
"radial glia 1","radial glia 2","radial glia 3","stem-like 1",
"unknown")
new.order <- rev(new.order)
AB.order <- c("CD24","CD56","CD29","CD15","CD184","CD133","CD71","CD44","GLAST","AQP4","HepaCAM", "CD140a","O4")
plotmean(plot_type = 'dotplot',seu = seu, group = 'CellType', markers = AB,
var_names = celltypes, slot = 'scale.data', xlab = "Cell Type",
ylab = "Antibody", var1order = new.order, var2order = AB.order)
Statistics comparing groups
# prepare a dataframe for stats
# this function takes the annotated seurat object with all the variables already existing as metadata slots
# check what meta data slots exist in your object
colnames(seu@meta.data)
# run the function
var.names <- c("Sample","DaysinCulture", "Batch", "ExDate", "Genotype", "CellType")
df.for.stats <- Prep_for_stats(seu, marker_list = AB, variables = var.names,
marker_name = 'Marker')
# save the csv for later
write.csv(df.for.stats, "/Users/rhalenathomas/Documents/Data/FlowCytometry/PhenoID/Analysis/testingLibrary/df4statsFeb26.csv")
head(df.for.stats)
First we have 3 variables: Experimental variable, Cell types, Markers We want to compare the mean expression per group: For all antibodies together in each cell type between a given variable (Genotype) (One way anova) We want to compare each antibody separately for a given variable for each cell type (two way anova)
df.means <- df.for.stats %>% group_by(Sample, CellType, Marker) %>%
summarize(mean_value = mean(value), .groups = )
head(df.means)
dim(df.means)
#%>% left_join(df.for.stats, by = c("Sample","CellType","Marker"))
df_means <- df.for.stats %>%
group_by(Sample, CellType, Marker) %>%
mutate(mean_value = mean(value)) %>%
distinct(Sample, CellType, Marker, mean_value, .keep_all = TRUE) %>% select(-value)
get_means <- function(df, group_cols, value_col) {
df_means <- df %>%
group_by(across(all_of(group_cols))) %>%
mutate(mean_value = mean(value)) %>%
distinct(across(all_of(group_cols)), mean_value, .keep_all = TRUE) %>%
select(-value)
return(df_means)
}
df_means <- get_means(df.for.stats, group_cols = c("Sample","CellType","Marker"), value_col = "value")
head(df_means)
dim(df_means)
Train Random Forest model Requires a labelled seurat object More cells will give higher accuracy but increase computation time. Run in HPC with lots of cells
rf <- RFM_train(seurate_object = seu,
AB_list = AB, annotations = seu$CellType3,
split = c(0.8,0.2),
downsample = 20000,
seed = 222,
mytry = c(1:10),
maxnodes = c(12: 25),
trees = c(250, 500, 1000,2000),
start_node = 15)
save(rf, output_path,"trainedRFMFeb14.Rds")